Number of hash functions (k) should be floored, not ceiled #45
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Historically, when calculating the number of hash functions (k) for Bloom filters, developers have commonly opted to use ceil() on the resulting float value. However, this approach is suboptimal. Instead, calling floor() on the result offers clear advantages over ceiling.
Flooring the number of hash functions does not meaningfully degrade the false-positive rate, as reducing by just one function still maintains a similar error probability. The risk of degradation only becomes a concern when additional hash functions are removed. However, by using floor() instead of ceil(), benchmarks demonstrate a notable 10% improvement in processing time. This improvement stems from the reduced number of hash functions, which decreases the filter's fill rate while keeping the same total bit count. As a result, the computational efficiency is enhanced without significantly compromising accuracy.
Moreover, having fewer hash functions means that, for elements that are 'probably in the set,' fewer bits need to be checked. For elements that are 'definitely not in the set,' each hash function has a higher chance of failing earlier due to the lower fill rate, further boosting overall efficiency.
to